我们研究归一化对馈送前馈类型深神经网络层的影响。给定的图层$ i $,带有$ n_ {i} $隐藏单位,可以通过$ 1/n_ {i}^{\ gamma_ {i}} $带有$ \ gamma_ {i} 1] $,我们研究了$ \ gamma_ {i} $选择对神经网络输出(例如方差)的统计行为的影响,以及MNIST数据集的测试准确性。我们发现,就神经网络的输出和测试准确性的差异而言,最佳选择是选择$ \ gamma_ {i} $等于一个,即平均场比例。我们还发现,对于外层,尤其如此,因为神经网络的行为在外层的缩放层中更敏感,而不是内层的缩放。数学分析的机制是神经网络输出的渐近扩展。该分析的一个重要实际结果是,它提供了一种系统性和数学知情的方式来选择学习率超级标准。这样的选择可以保证,随着$ n_i $的增长,神经网络的行为在统计上具有稳健的方式。
translated by 谷歌翻译
遥感图像中的Pansharpening旨在通过融合具有平面(PAN)图像的低分辨率多光谱(LRMS)图像直接获取高分辨率多光谱(HRMS)图像。主要问题是如何将LRMS图像的丰富光谱信息与PAN图像的丰富空间信息有效地结合。最近,已经提出了基于深度学习的许多方法,以便泛歌舞团的任务。然而,这些方法通常具有两个主要缺点:1)需要HRMS进行监督学习; 2)简单地忽略了MS和PAN​​图像之间的潜在关系并直接融合它们。为了解决这些问题,我们提出了一种基于学习劣化过程的新型无监督网络,称为LDP-Net。设计用于分别用于学习相应的降级过程的重新阻挡块和灰色块。另外,提出了一种新的混合损失函数,以在不同分辨率下限制泛散形图像和平底锅和平移和LRMS图像之间的空间和光谱一致性。 WorldView2和WorldView3图像上的实验表明,我们所提出的LDP-Net可以在没有HRMS样本的帮助下有效地融合平移和LRMS图像,从而在定性视觉效果和定量度量方面实现了有希望的性能。
translated by 谷歌翻译
Recent progress in geometric computer vision has shown significant advances in reconstruction and novel view rendering from multiple views by capturing the scene as a neural radiance field. Such approaches have changed the paradigm of reconstruction but need a plethora of views and do not make use of object shape priors. On the other hand, deep learning has shown how to use priors in order to infer shape from single images. Such approaches, though, require that the object is reconstructed in a canonical pose or assume that object pose is known during training. In this paper, we address the problem of how to compute equivariant priors for reconstruction from a few images, given the relative poses of the cameras. Our proposed reconstruction is $SE(3)$-gauge equivariant, meaning that it is equivariant to the choice of world frame. To achieve this, we make two novel contributions to light field processing: we define light field convolution and we show how it can be approximated by intra-view $SE(2)$ convolutions because the original light field convolution is computationally and memory-wise intractable; we design a map from the light field to $\mathbb{R}^3$ that is equivariant to the transformation of the world frame and to the rotation of the views. We demonstrate equivariance by obtaining robust results in roto-translated datasets without performing transformation augmentation.
translated by 谷歌翻译
This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
translated by 谷歌翻译
Downsampling and feature extraction are essential procedures for 3D point cloud understanding. Existing methods are limited by the inconsistent point densities of different parts in the point cloud. In this work, we analyze the limitation of the downsampling stage and propose the pre-abstraction group-wise window-normalization module. In particular, the window-normalization method is leveraged to unify the point densities in different parts. Furthermore, the group-wise strategy is proposed to obtain multi-type features, including texture and spatial information. We also propose the pre-abstraction module to balance local and global features. Extensive experiments show that our module performs better on several tasks. In segmentation tasks on S3DIS (Area 5), the proposed module performs better on small object recognition, and the results have more precise boundaries than others. The recognition of the sofa and the column is improved from 69.2% to 84.4% and from 42.7% to 48.7%, respectively. The benchmarks are improved from 71.7%/77.6%/91.9% (mIoU/mAcc/OA) to 72.2%/78.2%/91.4%. The accuracies of 6-fold cross-validation on S3DIS are 77.6%/85.8%/91.7%. It outperforms the best model PointNeXt-XL (74.9%/83.0%/90.3%) by 2.7% on mIoU and achieves state-of-the-art performance. The code and models are available at https://github.com/DBDXSS/Window-Normalization.git.
translated by 谷歌翻译
Label Shift has been widely believed to be harmful to the generalization performance of machine learning models. Researchers have proposed many approaches to mitigate the impact of the label shift, e.g., balancing the training data. However, these methods often consider the underparametrized regime, where the sample size is much larger than the data dimension. The research under the overparametrized regime is very limited. To bridge this gap, we propose a new asymptotic analysis of the Fisher Linear Discriminant classifier for binary classification with label shift. Specifically, we prove that there exists a phase transition phenomenon: Under certain overparametrized regime, the classifier trained using imbalanced data outperforms the counterpart with reduced balanced data. Moreover, we investigate the impact of regularization to the label shift: The aforementioned phase transition vanishes as the regularization becomes strong.
translated by 谷歌翻译
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. More interestingly, we can borrow the non-perfect category names, or even names from a foreign language, to improve the few-shot classification performance compared with random initialization. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37\% on ImageNet and 96.08\% on Stanford Cars, both using five-shot learning). We also investigate and analyze when the benefit of category names diminishes and how to use distillation to improve the performance of smaller models, providing guidance for future research.
translated by 谷歌翻译
We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon a vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant DINO~\cite{zhang2022dino}, and an efficient DETR training method Group DETR~\cite{chen2022group}. The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO. Group DETR v2 achieves $\textbf{64.5}$ mAP on COCO test-dev, and establishes a new SoTA on the COCO leaderboard https://paperswithcode.com/sota/object-detection-on-coco
translated by 谷歌翻译
Deep learning (DL) methods have been widely applied to anomaly-based network intrusion detection system (NIDS) to detect malicious traffic. To expand the usage scenarios of DL-based methods, the federated learning (FL) framework allows multiple users to train a global model on the basis of respecting individual data privacy. However, it has not yet been systematically evaluated how robust FL-based NIDSs are against existing privacy attacks under existing defenses. To address this issue, we propose two privacy evaluation metrics designed for FL-based NIDSs, including (1) privacy score that evaluates the similarity between the original and recovered traffic features using reconstruction attacks, and (2) evasion rate against NIDSs using Generative Adversarial Network-based adversarial attack with the reconstructed benign traffic. We conduct experiments to show that existing defenses provide little protection that the corresponding adversarial traffic can even evade the SOTA NIDS Kitsune. To defend against such attacks and build a more robust FL-based NIDS, we further propose FedDef, a novel optimization-based input perturbation defense strategy with theoretical guarantee. It achieves both high utility by minimizing the gradient distance and strong privacy protection by maximizing the input distance. We experimentally evaluate four existing defenses on four datasets and show that our defense outperforms all the baselines in terms of privacy protection with up to 7 times higher privacy score, while maintaining model accuracy loss within 3% under optimal parameter combination.
translated by 谷歌翻译
几十年来,对信用违约风险的预测一直是一个重要的研究领域。传统上,由于其准确性和解释性,逻辑回归被广泛认为是解决方案。作为最近的趋势,研究人员倾向于使用更复杂和高级的机器学习方法来提高预测的准确性。尽管某些非线性机器学习方法具有更好的预测能力,但通常认为它们缺乏金融监管机构的解释性。因此,它们尚未被广泛应用于信用风险评估中。我们引入了一个具有选择性选项的神经网络,以通过区分数据集来通过线性模型来解释,以提高可解释性。我们发现,对于大多数数据集,逻辑回归将足够,准确性合理。同时,对于某些特定的数据部分,浅神经网络模型可以提高精确度,而无需显着牺牲可解释性。
translated by 谷歌翻译